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Abstract. In this paper I will outline some of the aspects and problems of modern ce- 
lestial mechanics and stellar dynamics, in the context of the quickly growing computing 
facilities. I will point the attention on the great advantages in using, for astrophysical simu- 
lations, the modem, fast and cheap Graphic Processing Units (GPUs) acting as true super- 
computers. Finally, I present and discuss some characteristics and performances of a new 
double-parallel code exploiting the joint power of multicore CPUs and GPUs. 

Key words. Celestial mechanics- Stellar dynamics- Methods: N-body simulations- 
Supercomputing 



1. Introduction 

The role of gravity in physics is, of course, fun- 
damental. Anyway, this role is completely dif- 
ferent in earth physics respect to that in astro- 
physics. 

In terrestrial physics gravity is not too diffi- 
cult to be accounted for, because it simply acts 
as an external constant field to add to other 
more complicated interaction among the con- 
stituents of the system under study. In other 
words, physical systems on earth are not self- 
gravitating, and this implies an enormous sim- 
plification. In an astrophysical context, things 
are different: astronomical objects are self- 
gravitating. Their shape, volume and dynamics 
are determined mainly by self-gravity, which 
acts, often, in conjunction with the external 
gravity due to the presence of either outer bod- 
ies or general potential where the object is em- 
bedded in. External gravity determines the or- 
bit of the astronomical body, and influences its 
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shape, at least in its outskirts, by mean of tidal 
interactions, as well. 

A simple quantitative parameter to measure 
the role of self-gravity to the whole energetics 
of a given system is the ratio, a, between the 
self-gravitation energy of the system and the 
energy given by the external gravitation field 
where the system is embedded in. For a typi- 
cal terrestrial system like, e.g., the Garda lake 
a ^ 10"*^, while for two, quite different, astro- 
nomical systems (a typical globular cluster in 
a galaxy and a typical galaxy in a galaxy clus- 
ter) a ^ 10"^: a million times greater Apart 
from the other, obvious, differences (a lake is 
composed by a liquid, where the collisional 
time scale is negligible respect to any other 
time scale in the system, while the globular 
cluster and the galaxy are composed by stars 
moving in volumes such that the collisional 2- 
body time scale is comparable, in the case of 
globular cluster, or much longer, in the case of 
galaxy, to the system orbital time and age), it 
is clear that while the lake molecules mutual 
gravitational interactions are negligible respect 
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to the external field, this is not the case for the 
stars in globular clusters or galaxies. 

2. A^-body systems in astrophysics 

As stated above, self-gravity cannot be ne- 
glected when studying the physics of astro- 
nomical objects. This makes theoretical astro- 
physics a hard field: the dynamics of astro- 
physical systems is intrinsecally difficult to be 
studied, even in newtonian approximation, be- 
cause of the double divergence of the, simple, 
two-body interaction potential, Uij oc l/r,j, 
where r,j is the euclidean distance between the 
i and j particle, 

nj = yjixi - xy)2 + (yi - jj)^ + {Zi - Zj)^. 
Ultra-violet divergence corresponds to very 
close encounters, infra-red divergence to that 
the gravitational interaction never vanishes. 
These divergences introduce a multiplicity of 
time scales (Aarseth 1985) and make impos- 
sible to rely on statistical mechanics and/or 
to non-perturbative methods, as often done in 
other particle-systems physics. Actually, the 
newtonian A^-body dynamics is mathemati- 
cally represented by the system of second- 
order differential equations 

< r;(0) = r,o, (1) 

r,(0) = r,o, 
iO-= 1,2,...,A'). 

This dynamical system is characterized by: (i) 
0(N^) complexity, (ii) being far from linear- 
ity, (iii) having few constraints in the phase- 
space. Sundman (1912) showed (without win- 
ning the King Oscar II Prize, already awarded 
to H. Poincare...) that for the three-body prob- 
lem there is a series solution for the coordinates 
in powers of f ' convergent for all f , except ini- 
tial data which correspond to zero angular mo- 
mentum. This result was generalized to any A^ 
just in relatively recent times by Wang (1991). 
Anyway, the power series solutions are so slow 
in convergence to be useless for practical use. 
This means that the gravitational A^-body prob- 
lem must be attacked numerically. 
The difl&culties in doing this are, both, theo- 
retical and practical. On the theoretical point 



of view, one has to face with the chaotic be- 
haviour of the nonlinear system which is re- 
lated to the extreme sensitivity of the system's 
differential equations to the initial conditions: 
a very small initial difference may result in an 
enormous change in the long-term behaviour 
of the system. Celestial dynamics gives, in- 
deed, one of the oldest examples of chaos in 
physics. This problem is almost unsolvable; it 
may be just kept under some control by using 
sophisticated, high order time integration al- 
gorithms. On the practical side, the (obvious) 
greatest complication to face is the due to the 
infrared (large scale) divergence, that implies 
the need of computing all the oc A'^ force inter- 
actions between the pairs in the systems. This 
results in an extremely demanding computa- 
tional task, when A' is large (see Table I). We 
will now discuss some of the problems arising 
when dealing with the numerical study of the 
evolution of self-gravitating systems over the 
astronomical range of A'^. 

3. Small- and Large- systems: from 
celestial mechanics to stellar 
dynamics 

On the small- A^ side (A'^ < 10, example: so- 
lar system) the problem is not that of enor- 
mous CPU time consumption, for the number 
of pairs is small, but, rather, that of the need of 
an enormous precision. This to keep the round- 
off error within acceptable bounds when inte- 
grating over many orbital times. In the case of 
few bodies, reliable investigations cannot ac- 
cept the point mass scheme (for instance, the 
Sun potential requires a multipole expansion) 
and high precision codes are compulsory. Pair 
force evaluation is computationally cheap due 
to the low number of pairs; on the other side, 
even very small round-off errors increase secu- 
larly, time step by time step, making high order 
symplectic integration algorithms unavoidable. 
The need is: a fast computer, able to handle 
with motion integration over a very extended 
time and able to evaluate forces with enormous 
precision. 

We do not speak any further of the few body 
regime, which is the realm of modern celes- 
tial mechanics and space dynamics, but go to 
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Table 1. Some typical astronomical systems, with their star number (N), number of floating point 
operations needed for the force evaluations in a single system configuration (ny) and CPU time 
required to the n/ operations by a single processor of 1 Gflops speed (tcpt/, in seconds). Note 
that 1.8 xlO'4 sec^5.7Myr! 



system 


A' 


«/ 


^CPU 


Open cluster 


1000 


1.5 X 10' 


0.02 


Globular cluster 


Itf" 


1.5 X W 


180 


Galaxy 


lO'i 


1.5 X 10^^ 


1.8 X 10'* 



say something on the problem of intermediate- 
and large-N-body systems, task which is typ- 
ical of the modern stellar dynamics, instead. 
Force computation by pairs is computation- 
ally expensive, the mostly demanding part be- 
ing the evaluation of the distance rjj between 
the generic ; and j particle. It requires the 
computation of a square root which, still with 
modem computers, is based on ancient meth- 
ods among which the Erone's method, the 
Bombelli's method and the Newton-Raphson 
numerical solution of the quadratic equation 

- rjj = 0. In any case, the single pair force 
evaluation requires about 30 floating point op- 
erations; this means that in an N-body system, 
nf = 30 X N(N - l)/2 floating point opera- 
tions are required. A single processor (PE) with 
a speed of 1 Gflops would compute the single 
pair force in ~ 3 x 10"^ sec. Consequently, the 
whole N star forces would require the time in- 
dicated in Table I for their evaluation at every 
time step. Clearly, the task of following nu- 
merically the long term evolution of a large- 
N-body system by a program based on direct 
summation of pair forces is very far out of the 
capability even of the most performing com- 
puters. Actually, the profiUng of any computer 
code to integrate N-body evolution indicates 
that about 70% of the CPU time is spent in 
force evaluation. 

What strategies must be used, then? 

The most natural way to attack the prob- 
lem is a proper combination of the following 
ingredients: (i) simplification of the interaction 
force calculation; (ii) reduction of the number 
of times that the forces have to be evaluated, 
by a proper variation of the time step both in 
space and in time; (iii) use of the most power- 



ful (parallel) computers available. 

Points (i) and (ii) require a deep effort of nu- 
merical analysis, point (iii) requires the solu- 
tion of the, not easy, problem of parallelizing 
an N-body code. 

The simpUfication of force calculation may 
be done by means of the introduction of 
space grids, both for computing the large 
scale component of the gravitational force 
via the solution of the Poisson's equation 
(with Fast-Fourier codes, for example) and for 
the dynamic subdivision of the space domain 
with a recursive, octal tree to take computa- 
tional advantage by a multipole expansion of 
the interaction potential (approach first used 
by Barnes & Hut 1986). These are two of 
the possibilities to reduce the particle-particle 
(PP) force evaluation to a particle-mesh (PM) 
or particle-particle-particle-mesh (P3M) ap- 
proach, with obvious computational advan- 
tages (see Hockney & Eastwood 1988 for a 
general discussion). In addition to the com- 
plications introduced in the computer code, a 
clear Umit of this procedure is the error intro- 
duced in the force evaluation, which can be re- 
duced, over the small scale, by keeping a direct 
PP force evaluation for close neighbours. Point 
(ii), time stepping variation, relies mainly on 
the use of individual (per particle) time steps. 
Particles are advanced with a time step proper 
to the individual acceleration felt, allowing a 
reduction in highly dynamical cases without 
stopping the overall calculation. Unfortunately, 
individual time stepping requires careful im- 
plementation to guarantee synchronous inte- 
gration and implies, often, a reduction of order 
of precision of the integration method. Finally, 
the parallelization of gravitational codes (point 
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(iii)) is difficult, because gravity is such that 
the force on every particle depends on the po- 
sition of all the others. This makes non triv- 
ial a domain decomposition such to release a 
balanced computational weight to the various 
PEs of a parallel machine. In this context, it 
is relevant noting that many active groups of 
research chose to use 'dedicated' parallel ar- 
chitectures, which act as boosters of specific 
computations, like those of the distances be- 
tween particles. This is the road opened by 
the Japanese GRAPE group lead by Makino 
(Makino 1991). Another, intriguing, possibil- 
ity is the use of Graphic Processing Units 
(GPUs) as cheap alternatives to dedicated sys- 
tems. GPUs are used to speed up force compu- 
tations and give high computing performances 
at much lower costs, especially in cases where 
double precision is not required. This is the 
choice explored in astrophysics first by S. 
Portegies Zwart and his dutch group (Portegies 
Zwart, Belleman & Geldof 2007). Capuzzo- 
Dolcetta and collaborators in Italy (Capuzzo- 
Dolcetta, Maschietti & Mastrobuono-Battisti 
2009) have constructed a direct N-body code 
implementing sophisticated 2"^ and 6* or- 
der symplectic time-integration and using as 
force evaluation accelerator a pair of brand 
new NVIDIA TESLA C1060 Graphic process- 
ing Units (GPUs) programmed by means of 
the native NVIDIA Compute Unified Device 
Architecture (CUDA, see 
www.nvidia. com/object/cudaJiome. html) . 

4. The NBSymple code 

The code generates, first, the initial condi- 
tions for the A^-body system, whose individual 
masses may be chosen by a given mass spec- 
trum. The total mass of the system, M, is as- 
sumed as mass unit. For the sake of simplic- 
ity, aiming first at checking quality of integra- 
tion and at performances testing, particles were 
given an initial spatially uniform distribution 
within a sphere of given (unitary) radius, R, 
with velocities, also, uniformly distributed in 
direction and absolute values and rescaled, in 
their magnitude, to reproduce a given value of 
the virial ratio. We remind that the virial ra- 
tio is defined as Q = 2K/\n\, where K and Q 



are, respectively, the system kinetic and poten- 
tial energies; for a stationary system, 2=1- 
Note that the further assumption G = 1 in the 
equations of motion implies that the 'cross- 
ing' timer = (GM)'^^/?^/^ is the unit of time. 

The code allow the introduction of a soften- 
ing parameter (e) in the star-star interaction po- 
tential, usually taken as a fraction of the clos- 
est neighbour average distance. The pairwise 
forces are summed to the force due to the ex- 
ternal field, which is accounted by an analytical 
expression for the Galactic potential as given 
by Allen & Santillan (1991). In this latter work 
the authors consider the Galactic potential as 
given by three components: a bulge, a disk and 
a halo. The bulge and the halo have a spherical 
symmetry, while the disk is axisymmetric. 
Any kind of generalization to diferent sets of 
initial conditions and external potentials is easy 
done by mean of appropriate external subrou- 
tines provided by the user. 

4.1. Time integration 

It is well known that ordinary numerical 
methods for integrating Newtonian equations 
of motions become dissipative and exhibit 
incorrect long term behaviour. This is a serious 
problem when facing A'-body problems, 
particularly when studying their long term 
evolution. One possibiUty is to use symplectic 
integrators. Symplectic integrators are nu- 
merical integration schemes for Hamiltonian 
systems, which conserve the symplectic 
two-form dp A dq exactly, so that (q(0), p(0)) 
(q(T), p(t)) is a canonical transformation. 
The transformation is characterized by time 
reversibility. If the integrator is not symplectic, 
the error of the total energy grows secularly, 
in general. Our code allows the choice of 
two different symplectic methods. One is the 
simple, classic 'leapfrog' method, which is 
second order accurate; the other is a more 
accurate sixth order expUcit scheme whose 
coefficients are taken from the first column of 
the Table 1 of Kinoshita, Yoshida & Nakai 
(1991), which leads to a time integration 
conserving energy much better than that with 
the other two possible sets of coefficients in the 
Tableg. Of course, the 6-th order symplectic 
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integrator is much slower than the leap frog, 
requiring 7 evaluations of force functions per 
time step, like, for instance, in a 6-th order 
Runge Kutta method). 



4.2. The computing platform 

The workstation used to test and run our 
NBSymple code has a 2 Quad Core Intel Xeon 
processors, each running at l.OQGHz, 4GB 
DDR2 RAM at 661MHz and two NVIDIA 
TESLA CI 060 CPUs, connected to the host 
via two slots PCI-E 16x. 
NVIDIA TESLA CI 060 has 240 processors, 
each of them has a clock of 1 .296 GHz. 

5. Results 

Accurate testing of both the quality of 
the A^-body system integration and of the 
computational efficiency of NBSymple is 
given in the Capuzzo-Dolcetta, Maschietti & 
Mastrobuono-Battisti (2009) paper. In that pa- 
per, the various versions of the code are pre- 
sented and discussed. Some versions work 
in single-precision arithmetics (exploiting at 
best the GPU performances but not fully sat- 
isfactory in terms of the precision) and in 
both harwdware (slower, more precise) and 
software (faster, less precise) double-precision 
arithmetics. The software double-precision is 
implemented following Goburov, Harfst & 
Portegies Zwart (2009). 
The NBSymple code has presently 5 versions, 
each labeled with an alphabetic letter from A 
toE: 

- NBSympleA: fully serial code running on 
a single Quad core processor; 

- NBSympleB: single-parallel code which 
uses Open Multi-Processing (OpenMP) di- 
rectives, for both the O(N^) pairwise in- 
teractions and the 0(N) calculations (i.e. 
the time integration and evaluation of the 
Galactic component of the force on the sys- 
tem stars) over the double Quad core host; 

- NBSympleC: single-parallel code, where 
the {0{N^) all-pairs interactions calcu- 
lations)are demanded to the NVIDIA 



TESLA CI 060 GPU, using CUDA while 
all the remaining tasks are done by a single 
Quad core CPU; 

- NBSympleD: double-parallel code, which 
again uses CUDA to evaluate the 0{N^) 
portion of the code (as NBSympleC), while 
the 0{N) computations is parallelized shar- 
ing work between all the eight cores of the 
host, using OpenMP, as NBSympleB; 

- NBSympleE: single-parallel code that uses 
CUDA on one or two GPUs to evaluate the 
total force over the system stars, i.e. both 
the all-pairs component and that due to the 
Galaxy. 

We emphasize that the pairwise interac- 
tions evaluation are developed, in the CUDA 
framework, following mainly the Nyland, 
Harris & Prins (2007) work. 
Here I just present a figure (Fig.[T]l showing a 
comparison of the time spent (in seconds) by 
various versions of the NBSymple code for a 
single time step integration of an A^-body sys- 
tem as function of the number of bodies. 
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Fig. 1. The (averaged over 1000 cycles) solar time (in seconds) spent by various versions of 
NBSymple to perform a single integration step, as a function of A'. Line with empty squares: 
NBSympleA code. Line with filled triangles: NBSympleB. Line with crosses: NBSympleC. Line 
with filled squares: NBSympleD. Line with empty triangles: NBSympleE with a single GPU. 
Line with stars: NBSympleE with two GPUs. 
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